EDA
In this section, we showcase our primary dataset as well as supplementary datasets to get the bigger picture of what data we are working with.
The goal of this section is to explore how we can tentatively use our data in tandem with strategies and techniques found from our literature review in order to profile syndemic relationships for type II diabetes.
Packages
Demographic Data
With our research goal of evaluating how the social and demographic factors interact with diabetes in a syndemic relationship, it is important to understand the demographic breakdown of the group that was studied by the National Health and Nutrition Examination Study (NHANES).
The demographic data set for the study includes 15560 observations of 29 variables including information on race, gender, family income, education level, and language spoken. Names and summaries of each of the variables are shown below.
Missing Values
The data set has 682 missing values for the age variable and 2201 missing values for the ratio_family_income_poverty variable.
Distribution of Continuous Variables
There are two continuous variables in the demographics data set: age and family income.
The distribution of ages present in the survey is a bit skewed with higher numbers of younger people ( <20 years old) than any other age.
Note:
The Department of Health and Human Services (HHS) poverty guidelines were used as the poverty measure to calculate this ratio. So, the ratio was calculated as:
Ratio = (Total Annual Income)/(Poverty Guideline specific to each year)
The survey had a higher number of participants at five times or greater the poverty line than any other level.
Distribution of Categorical Variables
There are several categorical variables in the demographics data set that may have associations with higher rates of diabetes diagnosis.
The majority of survey participants were born in the United States.
The majority of survey participants have at least a high school diploma, but the numbers are fairly similar across all five levels of education.
Non-hispanic black and non-hispanic white are the two most frequent races present in the survey sample.
Gender and Education Stratified by Race
To understand how confounding variables may affect our analysis, it is important to compare the distributions of various factors such as gender and education level by other demographic factors such as race.
The gender distribution is fairly even across all races in the survey sample.
Education levels, on the other hand, do differ across different races. Therefore education and race may be confounders in analysis of the survey data.
Diabetes Data
The diabetes data set from the National Health and Nutrition Examination Study contains information of diagnosis and progression of disease for each participant in the study. This dataset contains 28 variables which include when participants were diagnosed, whether or not they are on insulin, how frequently they see a doctor, etc. Names and summaries of each of the variables are shown below.
Missing Values
There are missing values in the age_informed, insulin_length, num_dr_visits_past_year, and how_often_glucose_check variables. These missing values are likely for participants who have not been informed of a diabetes diagnosis.
Distribution of Diagnostic Variables
The vast majority of participants in the survey have not been informed of any signs of type II diabetes diagnosis. To understand some characteristics of the survey participants who have been diagnosed, we filtered the data to include only these participants and looked at the distribution of some key variables.
The majority of participants who have been diagnosed with type II diabetes were informed of this diagnosis between the ages of 40 and 70.
Among type II diabetics, about one third of participants were taking insulin at the time of the survey.
Health and Nutrition Data
The health and nutritional behavior data details participant’s food choices, such as Breastfeeding and other childhood feeding practices, Frequency of getting meals prepared away from home, Frequency of getting meals from fast food or pizza places, Use of convenience foods, and knowledge of the my plate program. Names and summaries of variables are shown below. The data represent 15560 individuals with 46 different variables observed.
Column Names:
1. respondent_sequence_num
2. ever_breastfed_or_fed_breastmilk
3. age_stopped_breastfeeding_days
4. diet_healthiness
5. community_government_meals_delivered
6. eat_meals_at_community_senior_center
7. attend_kindergarten_thru_high_school
8. school_serves_school_lunches
9. school_serves_complete_breakfast_daily
10. summer_program_meal_free_reduced_price
11. meals_not_home_prepared_count
12. meals_from_fast_food_or_pizza_place_count
13. ready_to_eat_foods_past_30_days
14. frozen_meals_pizza_past_30_days
Data Types & Missing Values
Breastfeeding and Weaning
Table of respondents fed breast milk or breastfed:
Value Frequency Percentage
1 Yes 2066 78.73476
2 No 558 21.26524
Summary Statistics for age stopped breastfeeding in days:
mean_age_stopped_breastfeeding median_age_stopped_breastfeeding
1 198.6769 121
sd_age_stopped_breastfeeding min_age_stopped_breastfeeding
1 218.0595 5.397605e-79
max_age_stopped_breastfeeding
1 1095
Nutritional Practices
In the distribution of healthiness ratings we see that the most common rating for participants in the survey is “Good” while the ratings “Poor” and “Excellent” are the least common.
The number of meals not prepared at home, number of fast food meals, and number of ready to eat meals variables all have similar distributions with the low numbers (0-1) being the most frequently seen and higher numbers being the least frequently seen.
Education
Table of respondents who attended kindergarten through highschool:
Value Frequency Percentage
1 Yes 3849 78.73476
2 No 753 21.26524
Laboratory Data
There are 43 XPT data of laboratory tested data taken from the NHANES website. With so many XPT files of laboratory data, the cleaned dataset therefore contains 337 columns of variables. Many are strongly correlated with each other as some variables are the same just in a different metric. Due to how many XPT files are being combined and how many variables exist in each file, manually removing these highly correlated columns was not done. Additionally after combining each file to a common Respondent Sequence ID number, many missing values exist in each row. There are missing values in each row due to the combining process of each data file.
The cleaning process removed rows where all columns except for the first are NaNs as well as columns where there were only 1 unique value in each row. Below is a summary of the dataset as well as some visualizations of chosen variables among many that we will consider in this project.
We looked at the distributions of levels of six different key biomolecules that were tested for in the laboratory data: albumine, creatinine, arsenic, triglyceride, cholesterol, and hemoglobin.
Albumine in Urine (ug/mL) Testing
Creatinine (mg/dL) Testing
Arsenic Total (ug/L) Testing
Triglyceride (mg/dL) Testing
Total Cholesterol (mg/dL) Testing
Hemoglobin (g/dL) Testing
Questionnaire Data
The NHANES questionnaire data set has over 40 different variables regarding social behaviours, employment status, mental health, physical health, insurance coverage, and more. For the sake of getting a comprehensive analysis, we have selected the five factors that we want to explore further. The factors are alcohol consumption, depression, health access, insurance, and occupation. These factors pique our interest the most and/or were mentioned frequently in our review of the literature. Each of these factors is a sub-data set with multiple variables. These variables can be a general overview of the topic, such as the alcohol data set’s question “have you have consumed alcohol,” or quite specific, such as the alcohol data set’s question “how many days have you consumed 12+ drinks in the past year.” We have decided to select variables that we believed were the most representative of the subject and could give us the best overview of the respondent’s behaviour without going into the specifics for each question. We have selected 1-4 variables per sub-topic and these are the variables that we have performed EDA and survival analysis. Our selection of these variables is not to say that other variables are less important, rather we want to focus on variables that provide the most information possible.
Alcohol Data
For the alcohol data set, there were many questions about specific alcohol consumption behaviours, such as the questions mentioned above. We have selected two variables for this section: ever_had_a_drink_of_any_kind_of_alcohol and avg_alcoholic_drinks_per_day_past_12_months. We have cleaned the data to remove any outliers and then performed exploratory data analysis.
General Alcohol Consumption
The analysis of the respondents’ answer to “ever had a drink of any kind of alcohol” shows that 89.6% of respondents have consumed alcohol before, and 10.4% have not. We next want to explore how much those who have consumed alcohol tend to consume on average.
How Much Alcohol Consumed Per Day
We find that a majority of the respondents are having between 1-3 drinks per day. Those three groups encompass approximately 80% of the data. Individuals reporting 4-6 drinks per day make up another 10% of the results, and those having 7+ drinks per day make up the other 10%.
Depression Data
For the depression data, we decided to explore all of the variables in the data set. Each variable formats the question similarly, asking how many days have you felt ___ and has the same set of answer choices: “not at all”, “several days”, and “more than half the days.”
In every question asked in the depression questionnaire, most than half of the time, the respondent said not at all. The “feeling tired or having little energy” and “trouble sleeping or sleeping too much” say higher proportions of “several days” and “more than half the days” responses. The next highest not-at-all to other answers ratio was in “poor appetite or overeating”, and the other questions are all fairly even.
Health Insurance Data
We decided to explore two variables from the health insurance data, which are whether or not the respondent is covered by insurance and if so, what kind of insurance do they have.
# A tibble: 4 × 3
`Covered by Insurance?` Count Proportion
<chr> <int> <dbl>
1 Yes 13671 0.879
2 No 1852 0.119
3 Don't know 29 0.00186
4 Refused 8 0.000514
We found that around 87.9% of the respondents were covered by insurance, and 11.9% percent were not.
# A tibble: 7 × 3
`Insurance Type` No Yes
<chr> <int> <int>
1 covered_by_chip 15389 171
2 covered_by_medi_gap 15462 98
3 covered_by_medicaid 11381 4179
4 covered_by_medicare 12968 2592
5 covered_by_other_government_insurance 14552 1008
6 covered_by_private_insurance 8457 7103
7 covered_by_state_sponsored_health_plan 14623 937
We also founds that the most popular type of insurance was private insurance, followed by medicaid, medicare, and other types of government insurance.
Access to Healthcare and Hospital Usage Data
For the access to health care and hospital usage data, we decided to look at two variables: general health care conditions, and whether or not respondents have a regular place to go to for health care.
Respondents reported that they were generally in good health condition, followed by very good and then excellent conditions.
Most respondents also have a consistent place to go to for health care, such as an urgent care or primary care physician.
Occupation Data
Finally, for the occupation data set we wanted to explore how much people are working and what their job status is.
From this graph we can learn the vast majority of the respondents are working 35-40 hours per week, followed by 45-50 hours, then 40-45 hours.
A majority of the respondents are working at a job or business, followed by a good proportion of those who are out of work.
Social Meal Support
Many survey participants attend schools where lunch and breakfast are served daily and the majority of survey participants are not receiving meals from the community/government of free/reduced meals at summer programs.